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THE RELATIVE EFFICIENCY OP TWO TESTS AS A FUNCTION OF ABILITY LEVEL 



Abstract 

A new formula is developed for the relative efficiency of two tests 
measuring the same trait* The formula expresses relative efficiency solely 
in terms of the standard errors of measurement and, surprisingly, the 
frequency_jiistributions of true scores. Approximate methods for estimating 
relative efficiency may make this function routinely available. A numeri- 
cal illustration compares new and old estimates _of relative efficiency for 
subtests from the Scholastic Aptitude Test* 



THE RELATIVE EFFICIENCY OF TWO TESTS AS A FUNCTION OF ABILITY LEVEL* 

Birnbaum [1968] defines the relative efficiency of two testing proce- 
dures as the ratio of their information functions. Their relative efficiency 
will vary for different levels of the trait measured. Ideally, test manuals 
should report information functions or relative efficiencies as routinely 
as they now report reliability coefficients. 

The main purpose of the present note is to derive a useful and instruc- 
tive formula for relative efficiency, appropriate for two unidimensional 
tests measuring the same trait. It is necessary that the two tests be 
administered either to the same group or to approximately equivalent groups 
of examinees* The new formula shows that relative efficiency is closely 
related to the shapes of the true-score distributions of the two tests. 

The fii^st section briefly discusses information functions. The second 
section derives the new fomula. The third section presents a method of 
practical application and an empirical check. 

1. Information Function '^"^ 

A testing procedure produces a £jore x for each testee, presumed to 
be related to his standing on the trait 9 , hereafter called the ''ability, " 
measured by the procedure. The score x may be the number of questions 
answered correctly, or it may be a complicated function of the examinee's 
responses. If x were a consistent, preferably an unbiased estimator of 
9 , and if 9 were uniquely defined, the testing and scoring procedure 
could perhaps be evaluated by its sampling variance. Scores commonly 

■^Research reported in this paper has been supported by grant GB-52T81X 
from National Science Foundation. 



used (because of their convenience) are typically consistent estimators 
of some awkward function 6 , however. Worse yet, this function is seldom 
the same from one procedure to the next, ^except for the case, uninteresting 
for making cocrparisons, when the two procedures are strictly parallel. 
This situation usually causes no problems for the mental tester who is 
interested only in the relative standings of the examinees on 0 . For 
him, within limits, one monotonic function of 6 is about as good as 
another. This sitttation does prevent us, however, from comparing testing 
and scoring procedures simply in terms of the sampling variance of the 
score. 

Birnbaum C1968, p. kl%] suggests comparing scoring procedures by the 
.widths of their asymptotic confidence intervals for 9 . (in this dis- 
cussion, **asymptotic** indicates that the number, n , of test items is 
large.) This width is inversely proportional to the square root of 



termed the score information function . An alternative, nonasymptotic line 
of reasoning leading to this same function has been outlined by Lord 
[1952, eq. 57; 1971, eq. 6.5]. 

A few remarks about information functions will be listed below: 
1. In classical test theory, if x is a linear composite of item 
scores, lengthening the test k -fold will multiply the mean of x by k 
Since Var(x|9)= Var[ x- - €(x|0) | 9 ] represents the variance of the 



(1) 




errors of measurement, this quantity will be multiplied by k also (not 
2 

by k )• Thus, lengthening the test k -fold will multiply the score 
information function by k . Conversely, a percent increase in a score 
information function is most easily interpretable as equivalent to the 
increase achieved by lengthening a conventional test by the same percentage- 

2. If X is the maximum likelihood estimator 6 , then I{9,x) 5 
I{9,0) is asymptotically equal to the Fisher information measure 
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where L{u|9) is the likelihood function for the vector u of observed 
item responses [Birnbaum, 1968, 20*5l- Also, l{9,©) is equal to the 
reciprocal of the asymptotic variance of 0 . 

5. A nonasymptotic line of reasoning given by Rao (19^5^ PP- 270-1] 
suggests the use, even for small n , of 
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as a measure of the information about 9 contained in x • This Fisher 
information measure is necessarily less than or equal to the one given in 
the preceding paragraph. By virtue of the Cramer-Rao lower bound to the 
variance, we have under regularity conditions 
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V -.^^ [Mx|9)/d9] 
Ij,l0,xj 



Consequently, IpC^^i}} ^ IpCO^^c} > l{9,x} . If x is a sufficient 
statistic for 9 , the equality signs hold asymptotically [Kendall & 
Stuart, 1961, 17.37]- 



k. A linear transformation of x does not affect 1(0, x) , but a 
nonlinear transfbrmation changes l{0,x} . Asymptotically, the effect of 
a strictly mono tonic nonlinear transformation is negligible under mild 
conditions • 

5* A strictly monotonic nonlinear transformation of x has no effect 
on the information statistic (2) suggested by Rao, even in small samples^ 
since the likelihood of a sample of observations is not affected by the choice 
of scoring system. This is a very desirable property, in view of the fact 
that the choice of a score x , rather than some function of x is largely 
arbitrary. Rao's informaticn measure leads to a very complicated formula, 
however, when x is the number-bright score. For this reason, it will not 
be utilized here- 

6. Let 9* i Q^(Q) be a strictly monotonic transformation of the abil- 
ity scale. It is easily found from the chain r*ile for differentiation that 

(5) i{0*,x} = i{©,x3(a©*/a0)"^ , 
(h) y&^,x} = y9,x}(c)9*/a©)"2 . 

Thus the shape of the information function m?y be distorted to any con- 
tinuous single-valued curve by choice of 9* . In particular, the ability 
level at which maximum information" is obtained may be drastically changed 
by a transformation of the ability scale. 

?• It is seen from (5) that the relative'^'efficiency of measuring 
procedures x and y is not changed by a strictly monotonic transformation 
of the ability scale. For this reason, the parameter will be omitted from 
the corresponding symbol: 



(5) R.B.,„.,=I||^ = If|^ . 



8. Unless we are prepared to defend stronglj^ a particular choice of 
metric for ability, it will be wise in any practical investigation to 
present R*E* ctirves rather than the protean information curves. If desired, 
an actual measurement procedure can be compared in efficiency to a hypo- 
thetical standard" test composed of statistically equivalent items with 
specified item pai^meters, or to a hypothetical standard test characterized 
by a uniform distribution of item difficulties (Brogden, 1957^ P- 505)* 



The relative efficiency of two scores, x and y , is ordinarily 
computed from their score informtion functions by (5)- As an illustra- 
tion, consider the case of nuniber-right scores . 

For this special case, we have 



where u. = 1 or 0 represents a right or a wrong answer to item i * Thus 



2. A New Formula for Relative Efficiency 
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X = £ u. 

i=i ^ 
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(6) 



i=l 



Var (xle) = E P.i 
i=l ^ 



where = P^(©) is the characteristic function of item i and 



Qj^ r * - . Thus, from (l), the score information function of x for 
0 is [Birnbaum, I968, eq* 20.2.2] 

( ^ PI 

(7) i{9,x) = A^l]^ 

E P.Q. 

i=l 

where P!^ 5 Bp^/^ . In order to estimate relative efficiencies, it 
has until now seemed necessary to estin^te the item characteristic func- 
tions P. (9) for all n and n items* 
1 X y 

Let us now derive a new fomula for relative efficiency. We no 
longer require x to be a number-right score. 

3y definition, | e e(x|o) is the true score corresponding to x . 
HSince is ordinarily a strictly increasing function of 9 , as will 

be assumed here, we have from (6) that | is also a strictly monotonic 
transformation of 0 . From (5) we then have that the score information 
function of x for | is 

(8) I{§,x) = l{©,x)(a|/aQ)^^ . 
Finally, from (l) and (8), 

(9) = l/Var(xU) . ^ . 

(The numerator here is 1 because the regression of observed score on true 
score has unit slope.) If y measures the same trait as x , and 
n E e(yl©) denotes the true score for y , we have similarly 

I{n,y) = l/Var(yh) . 




Since | and r\ are both strictly monotonic transformations of © , 
it follows that tj E T](|) is a strictly monotonic transformatior of | . 
Thus we can use (5) to write down the score information function of y 
for I : 

The efficiency of y relative to x is now the ratio of (8) and 

(10): 



(u) H...(y,x,= (1)%^^ 



Similarly, 



The function t)(|) can be defined by the relation 



(15) f P(l)d| = j 



q(n)dn 



where p(|) and vvCt^ are the probability density functicr.s for | 
and n • Equation (l5) simply states that for any population, the pro- 
portion of cases lying below |^ must be the same as the proportion lying 
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below . This must be true, for | and n simply two different 

ways of expressing the individual's standing on a single psychological 
dimension. 

Since (15) holds for all |^ , we can differentiate both sides to 

obtain pCI^) = • Dropping the subscript and rearranging 

gives 



I q(n) 

Substituting (l^) in (ll) gives an interesting expression for the relative 
efficiency in terms of frequency distributions of true scores: 

where n S n(0 is the equipercentile equivalent of I , as required by 
(15). 

If X is a number-right score, the range of | is 0 to n^ , where 
n^ is the number of items in test x , and similarly for ^ . It may be 
desirable to rewrite (l5) in terms of ^ = |/n , z = x/n , 03 ? n/n , 

X X jr 

and w = y/n : 

^ ' h {(ii) 

where g and h are the density functions for ^ and o) . 
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To cur surprise, these formulas show that the relative efficiency of 
two tests can be expressed directly in terms of true-score frequency dis- 
tributions and standard errors of measurement- The formulas agree with 
the vague intuitive notion that a test is more discriminating at true -score 
levels where the scores are spread out, less discriminating at true-score 
levels where scores pile up. 

3. Practical Application 

Various convenient ways of estimating the expression on the right of 
(16) will be found. The crude but simple procedure of substi-cuting sample 
distributions of observed scores for p(4) and q(Tj) will be discussed in 
another publication* Here we discuss a particular estimation procedure 
available when x and y are number-right scores. Although this proce- 
dure is complicated, it is an order of magnitude simpler than estimating 
accurately all the item parameters required by (j)* In large samples, the 
new procedure seems to yield results that are much the same for most 
practical purposes. 

The functions g(^) and h(a)) needed for (16) are estimated from 
the sample frequency distributions of x and y by methods discussed by 
Lord [1969], using a revised version, available from the author, of the 
computer program described by Wingersky, Lees, Lennon, and 'li)ra [I969] * 
The functions Var(zl^) needed for (l6)are approximated by the formulas 
[Lord, 1965, eqs. 9, 3h] 

in) k r i n^(n - l)s^/[x(n - x) - s^ - n s^] , 

^ ^ X 2x'x 'p'^x ' X xp ' 

(18) Var(zU) = (n^ - 2k^)i(l - O/n^ , 
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wh'^re X and are the sample mean and variance (over people) of the 

X 

number -right scores, s is the sample variance (over items) of the p^^ , 
and is the sample proportion of correct answers to item i . 

Var(v;|a)) is obtained similarly. 

The relation between r\ and i , syinbolized , .le notation r\ 5 ti(|) 
is calculated numerically by a computer program [Stocking, Lees, Lennon, 
& Lord, 1969] that solves (15) for ^ . A revised version of this program, 
available from the author, also computes relative efficiencies from (16) 
and plots them as a function of | . No item characteristic curve param- 
eters are used anywhere in this method. This new method has been tested 
out and compared to the old method using 90 verbal items from the 
Scholastic Aptitude Test . For the old method, item parameters for all 
90 items were estimated simultaneously from 2926 ans'.^/er sheets by the 
maximum likelihood method described by Lord [l975]. Random responses were 
supplied for omitted responses (but not for items apparently "not reached** 
by the examinee). A U5-item "peaked" subtest was selected consistxng-'of 
those items having estimated difficulty parameters near the average value 
for the entire test. A "regular" subtest cc .sisted of the 45 even-numbered 
items. Actually, there was considerable overlap in items between the two 
subtests. However, the formulas used and the conclusions re^cihed are 
appropriate for two nonoverlapping U5-item tests having the same item 
parameters as the actual tests, with all examinees responding to all items. 

Estimated score information functions were computed from the estimated 
item parameters by (7)* The dashed curve in Figure 1 shows the ratio of 
these information functions, estimating the efficiency of the regular 



•li- 
test relative to the peaked test. A logarithmic scale is used for relative 
eff ■ 'C- since an R.E. of .5 is precisely as noteworthy as an R.E. of 
2.0, 

The solid curve in Figure 1 was obtained by the new method (l6), as 
described in the first paragraphs of this section. The necessary sample 
distributions for x and y were obtained simply by scoring the ^5-item 
subsets involved. Only the 1805 examinees who finished the test were 
used for these calculations. The solid curve shows some tendency to 
oscillate about the first curve, but in general seems to provide a satis- 
factory and very usable approximation . The oscillations could presumably 
be avoided by using larger samples or by other means . The computational 
cost of estimating (16) does not increase with sample size* 




Fig'^re 1. Estimate of relative efficiency from (l6) compared with 
estimate from (7) and (5)' *' 
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